Adapted from the singleSingleCell Bioconductor workflow by Aaron Lun, Davis McCarthy, and John Marioni.

1 Overview

Droplet-based scRNA-seq protocols capture cells in droplets for massively multiplexed library prepation (Klein et al. 2015; Macosko et al. 2015). This greatly increases the throughput of scRNA-seq studies, allowing tens of thousands of individual cells to be profiled in a routine experiment. However, it (again) involves some differences from the previous workflows to reflect some unique aspects of droplet-based data.

Here, we describe a brief analysis of the peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics (Zheng et al. 2017). The data are publicly available from the 10X Genomics website, from which we download the raw gene/barcode count matrices, i.e., before cell calling from the CellRanger pipeline.

library(BiocFileCache)
bfc <- BiocFileCache("../raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
    "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir=file.path("../raw_data/pbmc4k"))

2 Setting up the data

2.1 Reading in a sparse matrix

We load in the raw count matrix using the read10xCounts() function from the DropletUtils package. This will create a SingleCellExperiment object where each column corresponds to a cell barcode.

library(DropletUtils)
fname <- file.path("../raw_data/pbmc4k/raw_gene_bc_matrices/GRCh38")
sce <- read10xCounts(fname, col.names=TRUE)
sce
## class: SingleCellExperiment 
## dim: 33694 737280 
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(2): ID Symbol
## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ...
##   TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):

Here, each count represents the number of unique molecular identifiers (UMIs) assigned to a gene for a cell barcode. Note that the counts are loaded as a sparse matrix object - specifically, a dgCMatrix instance from the Matrix package. This avoids allocating memory to hold zero counts, which is highly memory-efficient for low-coverage scRNA-seq data.

class(counts(sce))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

2.2 Annotating the rows

We relabel the rows with the gene symbols for easier reading. This is done using the uniquifyFeatureNames() function, which ensures uniqueness in the case of duplicated or missing symbols.

library(scater)
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)
head(rownames(sce))
## [1] "RP11-34P13.3"  "FAM138A"       "OR4F5"         "RP11-34P13.7" 
## [5] "RP11-34P13.8"  "RP11-34P13.14"

We also identify the chromosomal location for each gene. The mitochondrial location is particularly useful for later quality control.

library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce)$ID, 
    column="SEQNAME", keytype="GENEID")
rowData(sce)$CHR <- location

3 Calling cells from empty droplets

3.1 Testing for deviations from ambient expression

An interesting aspect of droplet-based data is that we have no prior knowledge about which droplets (i.e., cell barcodes) actually contain cells, and which are empty. Thus, we need to call cells from empty droplets based on the observed expression profiles. This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced. The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts (Figure 1), probably corresponding to cell-containing and empty droplets respectively.

bcrank <- barcodeRanks(counts(sce))

# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
    xlab="Rank", ylab="Total UMI count", cex.lab=1.2)

abline(h=metadata(bcrank)$inflection, col="darkgreen", lty=2)
abline(h=metadata(bcrank)$knee, col="dodgerblue", lty=2)

legend("bottomleft", legend=c("Inflection", "Knee"), 
    col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)
Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts). The inferred locations of the inflection and knee points are also shown.

Figure 1: Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts)
The inferred locations of the inflection and knee points are also shown.

We use the emptyDrops() function to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool (Lun et al. 2018). Any significant deviation indicates that the barcode corresponds to a cell-containing droplet. We call cells at a false discovery rate (FDR) of 0.1%, meaning that no more than 0.1% of our called barcodes should be empty droplets on average.

set.seed(100)
e.out <- emptyDrops(counts(sce))
sum(e.out$FDR <= 0.001, na.rm=TRUE)
## [1] 4233

We then subset our SingleCellExperiment object to retain only the detected cells.

# using which() to automatically remove NAs.
sce <- sce[,which(e.out$FDR <= 0.001)]

3.2 Examining cell-calling diagnostics

As mentioned above, emptyDrops() assumes that barcodes with very low total UMI counts are empty droplets. Thus, the null hypothesis should be true for all of these barcodes. We can check whether the hypothesis test holds its size by examining the distribution of \(p\)-values for low-total barcodes. Ideally, the distribution should be close to uniform.

full.data <- read10xCounts(fname, col.names=TRUE)
set.seed(100)
limit <- 100   
all.out <- emptyDrops(counts(full.data), lower=limit, test.ambient=TRUE)
hist(all.out$PValue[all.out$Total <= limit & all.out$Total > 0],
    xlab="P-value", main="", col="grey80") 
Distribution of $p$-values for the assumed empty droplets.

Figure 2: Distribution of \(p\)-values for the assumed empty droplets

Large peaks near zero indicate that barcodes with total counts below lower are not all ambient in origin. This can be resolved by decreasing lower further to exclude barcodes corresponding to droplets with very small cells.

4 Quality control on the cells

The previous step only distinguishes cells from empty droplets, but makes no statement about the quality of the cells. It is entirely possible for droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis. We compute some QC metrics using calculateQCMetrics() (McCarthy et al. 2017) and examine their distributions in Figure 3.

sce <- calculateQCMetrics(sce, feature_controls=list(Mito=which(location=="MT")))
par(mfrow=c(1,3))
hist(sce$log10_total_counts, breaks=20, col="grey80",
    xlab="Log-total UMI count")
hist(sce$log10_total_features_by_counts, breaks=20, col="grey80",
    xlab="Log-total number of expressed features")
hist(sce$pct_counts_Mito, breaks=20, col="grey80",
    xlab="Proportion of reads in mitochondrial genes")
Histograms of QC metric distributions in the PBMC dataset.

Figure 3: Histograms of QC metric distributions in the PBMC dataset

Ideally, we would remove cells with low library sizes or total number of expressed features. However, this would likely remove cell types with low RNA content, especially in a heterogeneous PBMC population with many different cell types. Thus, we use a more relaxed strategy and only remove cells with large mitochondrial proportions, using it as a proxy for cell damage. (Keep in mind that droplet-based datasets usually do not have spike-in RNA.)

high.mito <- isOutlier(sce$pct_counts_Mito, nmads=3, type="higher")
sce <- sce[,!high.mito]
summary(high.mito)
##    Mode   FALSE    TRUE 
## logical    3922     311

5 Examining gene expression

The average expression of each gene is much lower here compared to the previous dataset (Figure 4). This is due to the reduced coverage per cell when thousands of cells are multiplexed together for sequencing.

ave <- calcAverage(sce)
rowData(sce)$AveCount <- ave
hist(log10(ave), col="grey80")
Histogram of the log~10~-average counts for each gene in the PBMC dataset.

Figure 4: Histogram of the log10-average counts for each gene in the PBMC dataset

The set of most highly expressed genes is dominated by ribosomal protein and mitochondrial genes (Figure 5), as expected.

plotHighestExprs(sce, n=25) + theme(text = element_text(size=14))
Percentage of total counts assigned to the top 25 most highly-abundant features in the PBMC dataset. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell.

Figure 5: Percentage of total counts assigned to the top 25 most highly-abundant features in the PBMC dataset
For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell.

6 Normalizing for cell-specific biases

We normalize endogenous genes using the computeSumFactors() function with an additional pre-clustering step (Lun, Bach, and Marioni 2016). Cells in each cluster are normalized separately, and the size factors are rescaled to be comparable across clusters. This avoids the need to assume that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters. Scaling is then performed to ensure that size factors of cells in different clusters are comparable.

We speed up clustering by performing fast dimensionality reduction and then clustering cells on the PCs. This is the purpose of the BSPARAM= argument, which instructs quickCluster() to use a approximate algorithm for PCA1 Using methods from the irlba package.. The approximation relies on stochastic initialization so we need to set the random seed for reproducibility - see below for more detail.

library(scran)
library(BiocSingular)
set.seed(1000)
clusters <- quickCluster(sce, BSPARAM=IrlbaParam())
table(clusters)
## clusters
##   1   2   3   4   5   6   7   8   9  10  11  12  13 
## 160 194 433 390 219 226 855 418 297 327 113 158 132

We apply the deconvolution method to compute size factors for all cells (Lun, Bach, and Marioni 2016). The specification of cluster= also ensures that we do not pool cells that are very different. We use a average count threshold of 0.1 to define high-abundance genes to use during normalization. This is lower than the default threshold of min.mean=1, reflecting the fact that UMI counts are generally smaller than read counts.

sce <- computeSumFactors(sce, min.mean=0.1, cluster=clusters)
summary(sizeFactors(sce))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.001487  0.706666  0.860268  1.000000  1.086259 11.695724

The size factors are well correlated against the library sizes (Figure 6), indicating that capture efficiency and sequencing depth are the major biases.

plot(sce$total_counts, sizeFactors(sce), log="xy")
Size factors for all cells in the PBMC dataset, plotted against the library size.

Figure 6: Size factors for all cells in the PBMC dataset, plotted against the library size

Finally, we compute normalized log-expression values. There is no need to call computeSpikeFactors() here, as there are no spike-in transcripts available.

sce <- normalize(sce)

7 Modelling the mean-variance trend

The lack of spike-in transcripts complicates the modelling of the technical noise. One option is to assume that most genes do not exhibit strong biological variation, and to fit a trend to the variances of endogenous genes (see here for details). However, this assumption is generally unreasonable for a heterogeneous population. Instead, we assume that the technical noise is Poisson and create a fitted trend on that basis using the makeTechTrend() function. Note that this tends to yield large biological components for highly-expressed genes for which Poisson noise is low (in the log-expression space). This often includes so-called “house-keeping” genes coding for essential cellular components such as ribosomal proteins. Although they can indeed be differentially expressed, their impact in the analaysis can be reduced by fitting the mean-variance trend to the endogenous genes.

new.trend <- makeTechTrend(x=sce)

We estimate the variances for all genes and compare the trend fits in Figure 7. The Poisson-based trend serves as a lower bound for the variances of the endogenous genes. This results in non-zero biological components for most genes, which is consistent with other UMI-based data sets (see the corresponding analysis of the Zeisel et al. (2015) data set).

fit <- trendVar(sce, use.spikes=FALSE, loess.args=list(span=0.05))
plot(fit$mean, fit$var, pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
curve(new.trend(x), col="red", add=TRUE)
Variance of normalized log-expression values for each gene in the PBMC dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances, while the red line represents the Poisson noise.

Figure 7: Variance of normalized log-expression values for each gene in the PBMC dataset, plotted against the mean log-expression
The blue line represents the mean-dependent trend fitted to the variances, while the red line represents the Poisson noise.

We decompose the variance for each gene using the Poisson-based trend, and examine the genes with the highest biological components.

fit$trend <- new.trend # overwrite trend.
dec <- decomposeVar(fit=fit) # use per-gene variance estimates in 'fit'.
top.dec <- dec[order(dec$bio, decreasing=TRUE),] 
head(top.dec)
## DataFrame with 6 rows and 6 columns
##                     mean            total              bio              tech
##                <numeric>        <numeric>        <numeric>         <numeric>
## LYZ     1.98067214731634 5.12675017749719 4.49570503819309 0.631045139304101
## S100A9  1.95576910152019 4.62353584276581 3.98858788463307 0.634947958132733
## S100A8   1.7244765771349 4.49362852275806 3.82909592857106 0.664532594186994
## HLA-DRA 2.10051754708266 3.73885530331417 3.12807677300468 0.610778530309484
## CD74     2.9074623077264 3.32923708020731 2.87761651579329 0.451620564414024
## CST3     1.4948953521548 2.97218277928397 2.29406052725509 0.678122252028884
##           p.value       FDR
##         <numeric> <numeric>
## LYZ             0         0
## S100A9          0         0
## S100A8          0         0
## HLA-DRA         0         0
## CD74            0         0
## CST3            0         0

We can plot the genes with the largest biological components, to verify that they are indeed highly variable (Figure 8).

plotExpression(sce, features=rownames(top.dec)[1:10])
Distributions of normalized log-expression values for the top 10 genes with the largest biological components in the PBMC dataset. Each point represents the log-expression value in a single cell.

Figure 8: Distributions of normalized log-expression values for the top 10 genes with the largest biological components in the PBMC dataset
Each point represents the log-expression value in a single cell.

We use denoisePCA() to choose the number of principal components (PCs) to retain based on the technical noise per gene. We need to set the seed for reproducibility when BSPARAM=IrlbaParam(), due to the use of randomized methods from irlba.

set.seed(12345)
sce <- denoisePCA(sce, technical=new.trend, BSPARAM=IrlbaParam())
ncol(reducedDim(sce))
## [1] 14

8 Doublet detection with clusters

In single-cell RNA sequencing (scRNA-seq) experiments, doublets are artifactual libraries generated from two cells. They typically arise due to errors in cell sorting or capture, especially in droplet-based protocols (Zheng et al. 2017) involving thousands of cells. Doublets are obviously undesirable when the aim is to characterize populations at the single-cell level. In particular, they can incorrectly suggest the existence of intermediate populations or transitory states that not actually exist. Thus, it is desirable to remove doublet libraries so that they do not compromise interpretation of the results. Note that doublet detection procedures should only be applied to libraries generated in the same experimental batch, since it is obviously impossible for doublets to form between two cells that were captured separately.

Several experimental strategies are available for doublet removal. One approach exploits natural genetic variation when pooling cells from multiple donor individuals (Kang et al. 2018). Doublets can be identified as libraries with allele combinations that do not exist in any single donor. Another approach is to mark a subset of cells (e.g., all cells from one sample) with an antibody conjugated to a different oligonucleotide (Stoeckius et al. 2017). Upon pooling, libraries that are observed to have different oligonucleotides are considered to be doublets and removed. These approaches can be highly effective but rely on experimental information that may not be available.

8.1 Clustering into subpopulations

We cluster cells into putative subpopulations using buildSNNGraph() (Xu and Su 2015). We use a higher k to increase connectivity and reduce the granularity of the clustering.

snn.gr <- buildSNNGraph(sce, use.dimred="PCA", k=10)
sce$Cluster <- factor(igraph::cluster_walktrap(snn.gr)$membership)
table(sce$Cluster)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13 
##  57 786 591 517 516 196 132 799  23 138  86  45  36

We visualize the clustering on a t-SNE plot (Van der Maaten and Hinton 2008). Figure 9 shows that there are a number of well-separated clusters as well as some more inter-related clusters.

set.seed(1000)
sce <- runTSNE(sce, use_dimred="PCA")
plotTSNE(sce, colour_by="Cluster")
t-SNE plot of the PBMC data set. Each point is a cell coloured according to its assigned cluster identity.

Figure 9: t-SNE plot of the PBMC data set
Each point is a cell coloured according to its assigned cluster identity.

8.2 Identify intermediate clusters

The doubletCluster() function will identify clusters that have intermediate expression profiles of two other clusters (Bach et al. 2017). Specifically, it will examine every possible triplet of clusters consisting of a query cluster and its two “parents”. It will then compute a number of statistics:

  • The number of genes (N) that are differentially expressed in the same direction in the query cluster compared to both of the parent clusters. Such genes would be unique markers for the query cluster and provide evidence against the null hypothesis, i.e., that the query cluster consists of doublets from the two parents. Clusters with few unique genes are more likely to be doublets.
  • The ratio of the median library size in each parent to the median library size in the query (lib.size*). Doublet libraries are generated from a larger initial pool of RNA compared to libraries for single cells, and thus the former should have larger library sizes. Library size ratios much greater than unity are inconsistent with a doublet identity for the query.
  • The proportion of cells in the query cluster should also be reasonable - typically less than 5% of all cells, depending on how many cells were loaded onto the 10X Genomics device.
dbl.out <- doubletCluster(sce, sce$Cluster)
dbl.out
## DataFrame with 13 rows and 9 columns
##         source1     source2         N        best              p.value
##     <character> <character> <integer> <character>            <numeric>
## 10            9           8         0       RPLP1                    1
## 1            10           2         2       CXCR4 0.000920377742252691
## 9            12          10        15         MAX 1.25113504352617e-05
## 8            10           1        16        JUNB 5.62017734849775e-14
## 4             6           3        24        GZMK 2.71004919950737e-64
## ...         ...         ...       ...         ...                  ...
## 7            13           1       124    HLA-DPB1 3.73025048151389e-22
## 6             9           4       180        RHOC 2.88811837963742e-35
## 13            9           7       210    C12orf75 1.40785705810756e-26
## 11            9           1       296       MS4A7 1.52409512408007e-31
## 5            13          10       321       MS4A1 1.40144612485557e-88
##              lib.size1         lib.size2                prop
##              <numeric>         <numeric>           <numeric>
## 10   0.112939416604338  1.05185739217153  0.0351861295257522
## 1    0.612272935429705 0.601053274309266  0.0145334013258542
## 9    0.885209713024283  8.85430463576159 0.00586435492095869
## 8    0.950699217824129  1.55273761554871   0.203722590515043
## 4    0.841415830546265  1.27703455964326   0.131820499745028
## ...                ...               ...                 ...
## 7     0.59224049331963 0.561065433367592  0.0336562978072412
## 6    0.150049685326267  1.18847300430606  0.0499745028046915
## 13  0.0655097613882863   1.6885032537961 0.00917899031106578
## 11  0.0637713803054832 0.922221440135145  0.0219275879653238
## 5     1.95228684359119  1.13241106719368   0.131565527791943
##                                       all.pairs
##                                 <DataFrameList>
## 10         9:8:0:...,12:8:5:...,13:8:20:...,...
## 1         10:2:2:...,13:2:31:...,9:2:37:...,...
## 9     12:10:15:...,12:6:19:...,13:12:22:...,...
## 8       10:1:16:...,10:3:16:...,10:4:22:...,...
## 4         6:3:24:...,10:6:48:...,8:6:71:...,...
## ...                                         ...
## 7   13:1:124:...,13:11:172:...,13:2:178:...,...
## 6     9:4:180:...,11:4:210:...,12:4:244:...,...
## 13     9:7:210:...,12:7:251:...,9:1:278:...,...
## 11      9:1:296:...,9:7:297:...,6:1:319:...,...
## 5   13:10:321:...,10:1:328:...,13:8:359:...,...

Examination of the output of doubletCluster() indicates that cluster 1 has the fewest unique genes and library sizes that are comparable to or greater than its parents. We see that every gene detected in this cluster is also expressed in either of the two proposed parent clusters (Figure 10).

markers <- findMarkers(sce, sce$Cluster, direction="any")
dbl.markers <- markers[["1"]]
chosen <- rownames(dbl.markers)[dbl.markers$Top <= 10]
plotHeatmap(sce, columns=order(sce$Cluster), colour_columns_by="Cluster", 
    features=chosen, cluster_cols=FALSE, center=TRUE, symmetric=TRUE, 
    zlim=c(-3, 4), show_colnames=FALSE)
Heatmap of mean-centred and normalized log-expression values for the top set of markers for cluster 1 in the PBMC dataset. Column colours represent the cluster to which each cell is assigned, as indicated by the legend.

Figure 10: Heatmap of mean-centred and normalized log-expression values for the top set of markers for cluster 1 in the PBMC dataset
Column colours represent the cluster to which each cell is assigned, as indicated by the legend.

The strength of doubletCluster() lies in its simplicity and ease of interpretation. Suspect clusters can be quickly flagged for further investigation, based on the metrics returned by the function. However, it is obviously dependent on the quality of the clustering. Clusters that are too coarse will fail to separate doublets from other cells, while clusters that are too fine will complicate interpretation.

Also note that the output of doubletClusters() should be treated as a prioritization of “high-risk” clusters that require more careful investigation - no fixed threshold on any one of the metrics alone is sufficient.

9 Doublet detection by simulation

9.1 Background

The other doublet detection strategy involves in silico simulation of doublets from the single-cell expression profiles (Dahlin et al. 2018). This is performed using the doubletCells() function from scran, which will:

  1. Simulate thousands of doublets by adding together two randomly chosen single-cell profiles.
  2. For each original cell, compute the density of simulated doublets in the surrounding neighbourhood.
  3. For each original cell, compute the density of other observed cells in the neighbourhood.
  4. Return the ratio between the two densities as a “doublet score” for each cell.

This approach assumes that the simulated doublets are good approximations for real doublets. The use of random selection accounts for the relative abundances of different subpopulations, which affect the likelihood of their involvement in doublets; and the calculation of a ratio avoids high scores for non-doublet cells in highly abundant subpopulations. We see the function in action below. Note that to speed up the density calculations, doubletCells() will perform a PCA on the log-expression matrix. When BSPARAM=IrlbaParam(), methods from the irlba package are used to perform a fast approximate PCA. This involves randomization so it is necessary to call set.seed() to ensure that results are reproducible.

set.seed(100)
dbl.dens <- doubletCells(sce, BSPARAM=IrlbaParam())
summary(dbl.dens)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.02822  0.07562  0.19595  0.16417 14.54507

The highest doublet scores are concentrated in a single cluster of cells in the centre of Figure 11.

sce$DoubletScore <- dbl.dens
plotTSNE(sce, colour_by="DoubletScore")
t-SNE plot of the PBMC data set. Each point is a cell coloured according to its doublet density.

Figure 11: t-SNE plot of the PBMC data set
Each point is a cell coloured according to its doublet density.

From the clustering information, we see that the affected cells belong to the same cluster that was identified using doubletCluster() (Figure 12).

plotColData(sce, x="Cluster", y="DoubletScore", colour_by="Cluster")
Distribution of doublet scores for each cluster in the PBMC data set. Each point is a cell.

Figure 12: Distribution of doublet scores for each cluster in the PBMC data set
Each point is a cell.

9.2 Strengths and weaknesses

The advantage of doubletCells() is that it does not depend on clusters, reducing the sensitivity of the results to clustering quality. The downside is that it requires some strong assumptions about how doublets form, such as the combining proportions and the sampling from pure subpopulations. In particular, doubletCells() treats the library size of each cell as an accurate proxy for its total RNA content. If this is not true, the simulation will not combine expression profiles from different cells in the correct proportions. This means that the simulated doublets will be systematically shifted away from the real doublets, resulting in doublet scores that are too low for the latter.

Simply removing cells with high doublet scores will not be sufficient to eliminate real doublets from the data set. As we can see in Figure 12, only a subset of the cells in the putative doublet cluster actually have high scores. Removing these would still leave enough cells in that cluster to mislead downstream analyses. In fact, even defining a threshold on the doublet score is difficult as the interpretation of the score is relative. There is no general definition for a fixed threshold above which libraries are to be considered doublets.

We recommend interpreting the doubletCells() scores in the context of cluster annotation. All cells from a cluster with a large average doublet score should be considered suspect. Close neighbours of problematic clusters (e.g., identified by clusterModularity(), see here) should be treated with caution. A cluster containing a small proportion of high-scoring cells is generally safe for downstream analyses, provided that the user investigates any interesting results to ensure that they are not being driven by those cells2 For example, checking that DE in an interesting gene is not driven solely by cells with high doublet scores.. While clustering is still required, this approach is more robust than doubletClusters() to the quality of the clustering.

9.3 Remove identified doublet cells

Both approaches above implicated cluster 1 as having a high concentration of doublets. Before proceeding, we’ll remove the 57 cells in this cluster.

sce_singlets <- sce[,-which(sce$Cluster==1)]

10 Save object for later use

Having completed preprocessing and normalization, we save the SingleCellExperiment object with its associated data to file. This avoids having to repeat all of the pre-processing steps described above prior to further analyses.

saveRDS(sce_singlets, file="pbmc_postQC.rds")

11 Exercises

For these exercises, use the SingleCellExperiment object that hasn’t been filtered for doublets (sce).

11.1 What proportion of measurements are zero?

# your code here
sum(counts(sce)==0)/length(counts(sce))
## [1] 0.9598411

11.2 DESeq2 size factors

How many genes can be used in the calcualtion of DESeq2 size factors?

# your code here
sum(apply(counts(sce), 1, function(x) all(x>0)))
## [1] 0

11.3 Compare empty droplet detection methods

For this exercise, use the full SingleCellExperiment object that hasn’t been processed for empty droplets.

sce.full <- read10xCounts(fname, col.names=TRUE)

Call empty drops with the algorithm from the CellRanger pipeline insetad of emptyDrops() as used above. The algorithm from CellRanger can be called using the defaultDrops() function.

# your code here
d.out <- defaultDrops(counts(sce.full))

Compare the empty drop calls between the two methods (using FDR 0.001 for emptyDrops) with a 2x2 table.

# your code here
table(d.out, e.out$FDR <= 0.001 & !is.na(e.out$FDR))
##        
## d.out    FALSE   TRUE
##   FALSE 732624    360
##   TRUE     423   3873

Plot a histogram of the total counts for the cells called by emptyDrops and not by CellRanger, and vise versa. Which tend to have higher total counts?

#your code here
cat <- rep(NA, length(d.out))
cat[!d.out & (e.out$FDR <= 0.001 & !is.na(e.out$FDR))] <- "emptyDrop only"
cat[d.out & !(e.out$FDR <= 0.001 & !is.na(e.out$FDR))] <- "default only"

library(dplyr)
data.frame(cat=cat, total=bcrank$total) %>%
  filter(!is.na(cat)) %>%
  ggplot(aes(x=total, fill=cat)) + 
  geom_histogram(bins=50)

11.4 Count-depth relationship

Pick a gene with high average expression. Plot the relationship between sequencing depth (size factors) and count of that gene.

# your code here
plot(sizeFactors(sce), counts(sce)[which.max(rowMeans(counts(sce))) ,]+1)

11.5 Highly variable genes without transformation

What would be the result of selecting the most highly variable genes without log transformation? Plot the distribution of expression values for the genes with the largest (untransformed) variance.

# your code here
chosen.genes.raw <- order(rowVars(as.matrix(counts(sce))),
                          decreasing=TRUE)[1:10]
plotExpression(sce, features=rownames(sce)[chosen.genes.raw]) 


12 Session Information

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] dplyr_0.8.2                 BiocSingular_1.1.4         
##  [3] scran_1.13.8                EnsDb.Hsapiens.v86_2.99.0  
##  [5] ensembldb_2.9.2             AnnotationFilter_1.9.0     
##  [7] GenomicFeatures_1.37.3      AnnotationDbi_1.47.0       
##  [9] scater_1.13.8               ggplot2_3.2.0              
## [11] DropletUtils_1.5.4          SingleCellExperiment_1.7.0 
## [13] SummarizedExperiment_1.15.5 DelayedArray_0.11.2        
## [15] BiocParallel_1.19.0         matrixStats_0.54.0         
## [17] Biobase_2.45.0              GenomicRanges_1.37.14      
## [19] GenomeInfoDb_1.21.1         IRanges_2.19.10            
## [21] S4Vectors_0.23.17           BiocGenerics_0.31.4        
## [23] BiocFileCache_1.9.1         dbplyr_1.4.2               
## [25] BiocStyle_2.13.2           
## 
## loaded via a namespace (and not attached):
##  [1] Rtsne_0.15               ggbeeswarm_0.6.0        
##  [3] colorspace_1.4-1         dynamicTreeCut_1.63-1   
##  [5] XVector_0.25.0           BiocNeighbors_1.3.2     
##  [7] bit64_0.9-7              R.methodsS3_1.7.1       
##  [9] knitr_1.23               Rsamtools_2.1.2         
## [11] R.oo_1.22.0              pheatmap_1.0.12         
## [13] HDF5Array_1.13.2         BiocManager_1.30.4      
## [15] compiler_3.6.0           httr_1.4.0              
## [17] dqrng_0.2.1              assertthat_0.2.1        
## [19] Matrix_1.2-17            lazyeval_0.2.2          
## [21] limma_3.41.6             htmltools_0.3.6         
## [23] prettyunits_1.0.2        tools_3.6.0             
## [25] rsvd_1.0.1               igraph_1.2.4.1          
## [27] gtable_0.3.0             glue_1.3.1              
## [29] GenomeInfoDbData_1.2.1   rappdirs_0.3.1          
## [31] Rcpp_1.0.1               Biostrings_2.53.0       
## [33] rtracklayer_1.45.1       DelayedMatrixStats_1.7.1
## [35] xfun_0.8                 stringr_1.4.0           
## [37] ps_1.3.0                 irlba_2.3.3             
## [39] statmod_1.4.32           XML_3.98-1.20           
## [41] edgeR_3.27.6             zlibbioc_1.31.0         
## [43] scales_1.0.0             hms_0.4.2               
## [45] ProtGenerics_1.17.2      rhdf5_2.29.0            
## [47] RColorBrewer_1.1-2       yaml_2.2.0              
## [49] curl_3.3                 memoise_1.1.0           
## [51] gridExtra_2.3            biomaRt_2.41.5          
## [53] stringi_1.4.3            RSQLite_2.1.1           
## [55] highr_0.8                simpleSingleCell_1.9.8  
## [57] rlang_0.4.0              pkgconfig_2.0.2         
## [59] bitops_1.0-6             evaluate_0.14           
## [61] lattice_0.20-38          purrr_0.3.2             
## [63] Rhdf5lib_1.7.2           GenomicAlignments_1.21.4
## [65] labeling_0.3             cowplot_0.9.4           
## [67] bit_1.1-14               tidyselect_0.2.5        
## [69] processx_3.3.1           magrittr_1.5            
## [71] bookdown_0.11            R6_2.4.0                
## [73] DBI_1.0.0                pillar_1.4.2            
## [75] withr_2.1.2              RCurl_1.95-4.12         
## [77] tibble_2.1.3             crayon_1.3.4            
## [79] rmarkdown_1.13           viridis_0.5.1           
## [81] progress_1.2.2           locfit_1.5-9.1          
## [83] grid_3.6.0               blob_1.1.1              
## [85] callr_3.2.0              digest_0.6.19           
## [87] R.utils_2.9.0            openssl_1.4             
## [89] munsell_0.5.0            beeswarm_0.2.3          
## [91] viridisLite_0.3.0        vipor_0.4.5             
## [93] askpass_1.1

References

Bach, K., S. Pensa, M. Grzelak, J. Hadfield, D. J. Adams, J. C. Marioni, and W. T. Khaled. 2017. “Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing.” Nat Commun 8 (1): 2128.

Dahlin, J. S., F. K. Hamey, B. Pijuan-Sala, M. Shepherd, W. W. Y. Lau, S. Nestorowa, C. Weinreb, et al. 2018. “A single-cell hematopoietic landscape resolves 8 lineage trajectories and defects in Kit mutant mice.” Blood 131 (21): e1–e11.

Kang, H. M., M. Subramaniam, S. Targ, M. Nguyen, L. Maliskova, E. McCarthy, E. Wan, et al. 2018. “Multiplexed droplet single-cell RNA-sequencing using natural genetic variation.” Nat. Biotechnol. 36 (1): 89–94.

Klein, A. M., L. Mazutis, I. Akartuna, N. Tallapragada, A. Veres, V. Li, L. Peshkin, D. A. Weitz, and M. W. Kirschner. 2015. “Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.” Cell 161 (5): 1187–1201.

Lun, A., S. Riesenfeld, T. Andrews, T. P. Dao, T. Gomes, participants in the 1st Human Cell Atlas Jamboree, and J. Marioni. 2018. “Distinguishing Cells from Empty Droplets in Droplet-Based Single-Cell Rna Sequencing Data.” bioRxiv. Cold Spring Harbor Laboratory. https://doi.org/10.1101/234872.

Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April): 75.

Macosko, E. Z., A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh, et al. 2015. “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets.” Cell 161 (5): 1202–14.

McCarthy, D. J., K. R. Campbell, A. T. Lun, and Q. F. Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8): 1179–86.

Stoeckius, M., S. Zheng, B. Houck-Loomis, S. Hao, B. Yeung, P. Smibert, and R. Satija. 2017. “Cell "Hashing" with Barcoded Antibodies Enables Multiplexing and Doublet Detection for Single Cell Genomics.” bioRxiv. Cold Spring Harbor Laboratory. https://doi.org/10.1101/237693.

Van der Maaten, L., and G. Hinton. 2008. “Visualizing Data Using T-SNE.” J. Mach. Learn. Res. 9 (2579-2605): 85.

Xu, C., and Z. Su. 2015. “Identification of cell types from single-cell transcriptomes using a novel clustering method.” Bioinformatics 31 (12): 1974–80.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226): 1138–42.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January): 14049.